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Abstract 

We present a graph-based technique for estimating sparse covariance ma- 
trices and their inverses from high-dimensional data. The method is based 
on learning a directed acyclic graph (DAG) and estimating parameters of a 
multivariate Gaussian distribution based on a DAG. For inferring the under- 
lying DAG we use the PC-algorithm [2Z] and for estimating the D AG-based 
covariance matrix and its inverse, we use a Cholesky decomposition ap- 
proach which provides a positive (semi-)definite sparse estimate. We present 
a consistency result in the high-dimensional framework and we compare our 
method with the Glasso M, Q] for simulated and real data. 



1 Introduction 

Estimation of covariance matrices is an important part of multivariate analysis. There 
are many problems with high-dimensional data where an estimation of the covariance 
matrix is of interest, for example in principal component analysis or classification by 
discriminant analysis. Application areas where such problems arise include gene micro- 
arrays, imaging and image classification or text retrieval. In many of these applications, 
the primary goal is the estimation of the inverse of a covariance matrix S^^, also known 
as the precision or concentration matrix, rather than the covariance S itself. In low- 
dimensional settings with p < n, where p denotes the row- or column-dimension of T, 
and n the sample size, we can obtain an estimate of by estimation and inversion of 
the Gaussian maximum likelihood estimator T,mle- But when p is large, inversion of 
this estimate is problematic and its accuracy is very poor. 

Recently, two classes for high-dimensional covariance estimation have emerged: those 
that rely on a natural ordering among variables and typically assuming that variables 
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far apart in the ordering are only weakly correlated, and those which are invariant 
to variable permutation. Regularized estimation by banding or tapering 13, 0] or 



using sparse Cholesky factors of the inverse covariance matrix relying on the natural 



ordering of the variables [3l|, [ij, ll9|] are members of the first class of covariance esti- 
mators. When having no natural ordering among the variables, estimators should be 
permutation invariant with respect to indexing the variables. A popular approach to 
obtain a sparse permutation-invariant estimate is to add a Lasso penalty on the entries 
of the concentration matrix to the negative Gaussian log-likelihood |l2l . 

SSH- This 



amounts to shrinking some of the elements of the inverse covariance matrix exactly to 
zero. Alternatively, the Lasso can be used for inferring an undirected conditional in- 



dependence graph using node-wise regressions [2j] and a covariance estimate can then 
be obtained using the structure of the graph. Other approaches include a simple hard- 
thresholding of the elements of the unpenalized maximum likelihood estimator 0], with 
the disadvantage that the resulting estimate is not necessarily positive (semi-) definite. 

The method which we present here is also invariant under permutation of the variables. 
The type of regularization which we pursue is based on exploiting a sparse graphical 
model structure first and then estimating the covariance matrix and its inverse using 
non-regularized estimation. Because of the sparsity of the graphical model structure, 
the second step does not need any regularization anymore. More precisely, we use a 
sparsely structured Cholesky decomposition of the concentration matrix for estimation 
of the covariance and concentration matrix. To obtain the structure of such a Cholesky 
factor, we estimate a DAG (in fact, an equivalence class of DAGs). Thus, this approach 
enforces a completely different sparsity structure on the Cholesky factor than proposals 
for ordered data as in e.g. [1, [ij, H, [§]. 

For a given DAG, our approach equals the iterative conditional fitting (ICF) method 



presented in [10|, |6[ which reduces here to the standard technique of fitting Gaussian 



DAG models. Our contribution is to use an estimated DAG, i.e. an estimated equiv- 



alence class of DAGs from the PC-algorithm [271], and to analyze the method in the 
high-dimensional case taking the uncertainty of structure estimation of the equivalence 
class of DAGs into account. We argue in this paper that within the class of methods 
which are invariant under variable permutation, a graph-structured approach can be 
worthwhile for a range of scenarios, sometimes resulting in performance gains up to 
30-50% over shrinkage methods. 

In Section[2]we give a brief overview over graph terminology and graphical models. Sec- 
tion [3] introduces our methodology and we show asymptotic consistency of the method 
in the high-dimensional framework in Sectional Simulations and real data examples are 
presented in Section [5] and we propose a robustified version of our procedure in Section 

El 
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2 Graph terminology and graphical models 



2.1 Graphs 

Let G = (y, E) be a graph with a set of vertices V and a set of edges E V x V . In 
our context, we use V = {1, . . . ,p} corresponding to some random variables Xi, ...,Xp. 

A graph can be directed, undirected or partially directed. An edge between two vertices, 
for example i and j, is called directed if the edge has an arrowhead: i <— j or z ^ j. 
An edge without arrowhead is an undirected edge: i — j- A graph in which all edges 
are directed is called a directed graph; and vice- versa, a graph in which no edge is 
directed is called an undirected graph. A graph which may contain both directed and 
undirected edges is called a partially directed graph. The underlying undirected graph 
of a (partially) directed graph G which we derive by removing all the arrowheads is 
called the skeleton of G. 

Two vertices i and j are adjacent if there is any kind of edge between them. The 
adjacency set of a vertex i, denoted by adj{i,G), is the set of all vertices that are 
adjacent to i in G. A path is a sequence of vertices {1, . . . ,k} such that i is adjacent to 
i + 1 for each i = l,...,A; — 1. A directed path is a path with directed edges that follows 
the direction of the arrows. When the first and last vertices coincide, the directed path 
is called a directed cycle. An acyclic graph is a graph that contains no directed cycles. 
A directed graph with no directed cycles is called directed acyclic graph (DAG). 
If i — > j, then i is called a parent of j and j is called a child of i. The set of parents of 
i in G is denoted as pa{i) and the set of children as ch(i). If there is a directed path 
from i to j, then i is called an ancestor of j and j is called an descendant of i. The 
set of ancestors of i is denoted as an(i), the set of descendants as de{i) and the set of 
non-descendants as nde{i). A v-structure in a graph G is an ordered triple of vertices 
{i,j,r), such that i ^ j and j <— r, and i and r are not adjacent in G. The vertex j is 
then called a collider. 

A path Q from i to j in a directed acyclic graph G is said to be blocked by S, if it 
contains a vertex v £ Q such that either (i) v £ S and v is no collider; or (ii) v ^ S nor 
has V any descendants in S, and u is a collider. A path that is not blocked by S is said 
to be active. Two subsets A and B are said to be d-separated by S if all paths from A 
to B are blocked by S. In other words, there is no active path from A to B. 

2.2 Graphical models and Markov properties 

Graphical models form a probabilistic tool to analyze and visualize conditional depen- 
dence between random variables, using some encoding with edges in the graph. Fun- 
damental to the idea of a graphical model is, based on graph theoretical concepts and 
algorithms, the notion of modularity where a complex system is built by combining sim- 
pler parts. One can distinguish between three main graphical models. Here we focus on 
DAG models, where all the edges of the graph are directed. According to [l^, a DAG 
model may exhibit several directed Markov properties. In the following, we present only 
two of them. 
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We use the following notation. Let P denote the distribution of {Xi, ...,Xp). For x G R^, 
we denote by xa = {xj; j £ A} for A C V = {1, . . . , p} and analogously for the random 
vector Xa- Furthermore, for disjoint subsets A,B and S, we denote by X^ -LL Xb\Xs 
conditional independence between Xa,Xb given Xs- 

Definition 2.1. [Directed global Markov property] 

Let A, B and S be disjoint subsets of V and G a DAG on V . If 

Xa -LL Xb\Xs 

whenever A and B are d-separated by S in the graph G, we say P obeys the directed 
global Markov property relative to the DAG G. 

Definition 2.2. [Recursive factorization property] 

We say that P admits a recursive factorization according to a DAG G whenever there 
exist non-negative functions fi{.\.) = 1, . . . such that 



j fi{Xi\Xpa(i))v{dXi) = 1 

and P has a density f with respect to the measure v, where 



f{Xi,...,Xp) — 

i=l 

If the density / of P is strictly positive, as for example in the case of a multivariate 
Gaussian distributiori, both Markov properties in Definitions 12.11 and 12.21 are equivalent. 



For more details see 18|, pp. 46-52]. 



A DAG model encodes conditional independence relationships via the notion of d- 
separation. Several DAGs could encode the same set of conditional independence 
relationships. These DAGs form an equivalence class, consisting of DAGs with the 
same skeleton and v-structures. A complete partially directed acyclic graph (CPDAG) 
uniquely describes such an equivalence class. In fact, directed edges in the CPDAG 
are common to all DAGs in the equivalence class. Undirected edges in the CPDAG 
correspond to edges that are directed one way in some DAGs and another way in other 
DAGs of the equivalence class. The absence of edges in the CPDAG means that all 
DAG members in the equivalence class have no corresponding edge. If all the con- 
ditional independence relationships of a distribution P and no additional conditional 
independence relations, can be inferred from the graph, we say that the distribution P 
is faithful to the DAG G. More precisely, if P is faithful to the DAG G: for any triple 
of disjoint sets A, B and S in V, 

Xa -LL Xb\Xs A and B are d-separated by 5* in G. 

Note that the directed global Markov property in Definition 1 2 . 1 1 implies the implication 
from the right- to the left-hand side; the other direction is due to the faithfulness 
assumption. 
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3 Covariance estimation based on DAGs 



Our methodology is based on two steps. We first infer the CPDAG, i.e. the equivalence 
class of DAGs, and we then estimate the covariance (concentration) matrix based on 
the CPDAG structure. 

We assume throughout the paper that the data are 

XM = (X;^),...,XM), r = l,...,n 

i.i.d. ~ P (1) 

with P being multivariate normal Mp{0, S), Markovian (as in Definition 12. II or 12. 2p and 
faithful to a DAG G. 



-^i I -^pa{i) 



is linear in ^^^(j) which will 



The Gaussian assumption implies that E 
be useful in the second estimation step for the concentration or covariance matrix. 
Moreover, it allows us to equate conditional independence with zero partial correlation 
which makes estimation for the CPDAG much easier. 



3.1 Estimating the covariance matrix from a DAG 

We first assume that the underlying DAG is given. Using the factorization property 
from Definition 12.21 in Section [2.21 we have: 

p 

f{Xi,...,Xp) = Y[f{Xi\Xpa(i)). 

1=1 

We use here and in the sequel the short-hand notation f{-\-) instead of fi{-\-)- For data 
as in ([1]), we can then write the likelihood function as 

n n p p n 

L = n f{xr\ ...^r ) = n n /(^ri^a,) = n n /(^f ^i^s,)- 

r=l r=l i=l i=l r=l 

Using the Gaussian assumption this leads to the likelihood in terms of the unknown 
parameter S (or S"-^ respectively). 



p 

i=l 



where n^paii) and ^i\pa{i) are the conditional expectation and variance of Xi given the 
parents Xpa(^iy Note that the conditional covariance is a fixed quantity whereas the 
conditional mean depends on the variables Xpa(^iy For a single random variable Xi we 
have: 



/^i|pa(i) 



E 



Xi X. 



pa{i) 



f^i ~^ '^i,pa{i){'^pa{i),pa(i)) (^pa(j) t^pa{i)) 



■'i,pa{i) \^pa{i),pa{i) ) 



X, 



(2) 



pa{i) 1 
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with assumption fii = 0\fi from above, and: 



-'i\pa{i) 



-'i,pa 



(3) 



The expressions '^i^pa(i) and Spa(j),pa(i) are sub-matrices formed by selecting the corre- 
sponding rows and columns from the full covariance matrix S. For example, Sj p^(j) is 
the sub-matrix (or vector) of S with row i and columns j € pa{i). The values Hi\pa{i) and 
^i|pa(i)) in the ith factor Li of the likelihood, are connected to regression, as described 
next. 



Consider for each node i a regression from Xi on Xp^^^-^ , where jXp^j-j) ~ AA(/x, 
We can represent these p regressions in matrix notation as follows: 



Xr, 



'i\pa{i) ) ^i|pa(i 



(4) 



where A is a p x p matrix corresponding to the regressions and e is the vector of the 
error terms. That is: 



-i^i,pa{i)i^pa{i),pa{i)) if j G pa(i) 

1 if j = i 

otherwise 



Now we can easily compute S or S , because we can write as 



{Xi ,Xp 



Hence, 



where 



S = Gov ((Xi, . . .,Xpf^ = Gov (^^^e) = ^"^ Gov (e) {A' 



1\T 



Gov (e) 



n|pa(l) 





^p\pa{p)^ 



(5) 



We then also easily obtain 

= {A-^ Gov (e) {A-Y)~^ = A^ Gov (e)"^ A. 
See 0, Ghap. 3] and [2l[ for more details. 

Since S and are based on the structure of the DAG G (via the matrix A) we write 
T^G and . An estimator is now constructed as follows. Consider the maximum 
likelihood estimator 



^MLE ^ ^-1 Y^^X^j) _ X)(X(^') - XY 



(6) 
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as an "initial" estimator Timu of S and use the following plug- in estimators: 

= A^Ckiv {ey^ A, (7) 

where A and Gov (e) are as in ^ but with the plug-in estimates ^f'pa^y {^pt(ifpa{i))~^ 
(for A) and '^f^pafi) (for Gov (e)) using formula ([3]). 

Note that the estimators in d?]) are automatically positive semi-definite having eigen- 
values > (and positive definite assuming Sj|pa(j) > for all j, which would fail only 
in very pathological cases). Furthermore, we could use another "initial" estimator than 
Y^MLE £qj. estimating Sj^pa(i), ^pa(i),i and ^pa(i),pa{i)- We are exploiting this possibil- 
ity for a robustified version, as discussed in Section [6l Finally, the estimator in ([7]) is 
implemented in the R-package ggm [6] . 

3.2 Inferring a directed acyclic graph 

The conditional dependencies between Xi,...,Xp and hence the DAG are usually not 
known. We use the PG-algorithm |27(] with estimated conditional dependencies to infer 
the corresponding GPDAG G, i.e. the equivalence class of DAGs (inferring the true 
DAG itself is well-known to be impossible due to identifiability problems). 

Estimation of the skeleton and partial orientation of edges are the two major parts of 
inferring a GPDAG. In the following we will describe these two steps. 



3.2.1 Estimating the CPDAG 

In a first step, we start from a complete undirected graph. When two variables Xi Xj 
are found to be conditional independent given X^ for some set K, the edge i — j is 
deleted: details are given in Algorithm [TJ In a second step, the edges are oriented using 
the conditioning sets K which made edges drop out in the first step: details are given 
in Algorithm [2j 

In the first step of the PG-algorithm, we need to estimate the conditional independence 
relations between Xi, ...,Xp. Under the Gaussian assumption conditional independen- 
cies can be inferred from partial correlations. Then, the conditional independence of 
Xi and Xj given Xk = {Xr;r £ K}, where K C {1, ...,p}\{i, j}, is equivalent to the 
following: the partial correlation of Xi and Xj given {Xr;r £ K}, denoted by Pij\Ki 
is equal to zero. This is an elementary property of the multivariate normal distribu- 



tion, see 18|, Prop. 5.2]. Hence to obtain estimates of conditional independencies we 
can use estimated partial correlations Pij\K- For testing whether an estimated partial 
correlation is zero or not, we apply Fisher's z-transform 



Zii,j\K) 




1 + Pij\K 
- Pi,j\K^ 



7 



Since Z{i,j \ K) has a AA(0, (n — \K\ — 3) ^) distribution if Pij\K = [ij], we have 
evidence that Pi.j\K 7^ if 

^n-\K\-?,\Z{i,j I K)\ > <^>-\l - |), 

where $ is the cumulative distribution function of the standard Normal distribution and 
the significance level < a < 1 is a tuning (threshold) parameter of the PC-algorithm 
described in Algorithms [1] and [21 



Algorithm 1: The PC-algorithm for the skeleton 



Input: z-transform of estimated partial correlations, tuning parameter a 
Output: Skeleton of CPDAG G, separation sets S (used later for directing the 
skeleton) 

1 Form the complete undirected graph G on the set {1, . . . ,p}; 

2 l = -l;G = G; 

3 repeat 
1 = 1 + 1; 
repeat 

Select an ordered pair of adjacent variables i, j in G such that 
\adj{i,G)\{j}\ >l; 
repeat 

Choose K C adj{i,G)\{j} with \K\ = I; 
if ^n- \K\ -3\Z{i,j \ K)\ < <^>-'^{l - a/2) then 
Delete edge i, j; 
Denote this new graph by G; 
Save K in S{i,j) and S{j,i); 

until edge i, j is deleted or all K C adj{i,G)\{j} with \K\ = I have been 
chosen ; 

until all ordered pairs of adjacent variables i and j, such that \adj{i,G)\{j}\ > I 
and K C adj{i^G)\{j} with \K\ = I, have been tested for conditional independence 

15 until for each ordered pair of adjacent nodes i,j: \adj{i,G)\{j}\ < I ; 



If Pij\K = is plausible, the edge i — j is deleted and K is saved in S{i,j). We call 
S = {S{i, j);i, j G {1, . . • i 7^ j} the separation sets. These sets are important for 
extending the estimated skeleton to a CPDAG as described below in Algorithm [2l 



23( 1 showed that the rules in Algorithm [2] are sufficient to orient all arrows in the 



CPDAG, see also [25|, pp.50]. The PC-algorithm, described in Algorithms [H and [21 
yields an estimate GcPDAG(a) of the true underlying CPDAG which depends on the 
tuning parameter a. 
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Algorithm 2: The PC-algorithm: extending the skeleton to a CPDAG 
Input: Skeleton G of CPDAG, separation sets S 
Output: CPDAG 

forall pairs of nonadjacent variables i, j with common neighbor k do 
if k ^ S{i,j) then 

|_ Replace i — k — j in Skeleton of G by z — > /c ^ j; 
repeat 

Rl Orient j — k into j ^ k whenever there is an arrow i ^ j such that i and k are 
nonadjacent; 

R2 Orient i — j into i ^ j whenever there is a chain i ^ k ^ j; 
R3 Orient i — j into i ^ j whenever there are two chains i ^ k ^ j and i ^ I ^ j 
such that k and / are nonadjacent; 
until no more orienting of undirected edges is possible by the rules Rl to R3 ; 



3.2.2 The PC-DAG covariance estimator 

Having an estimate GcpdagIc^) of the CPDAG, we pick any DAG G'dag(") in the 
equivalence class of the CPDAG. This can be done by directing undirected edges in the 
CPDAG at random without creating additional v-structures or cycles. The estimate for 
the covariance and concentration matrix is then: 

S'T-'^ , , as in formula (ITI), (8) 

Gdag(")' GDAG(a) 

and since the PC-algorithm for DAGs is involved, we call it the PC-DAG covariance 
estimator. Its only tuning parameter is a used in the PC-algorithm. As described in 
Section 13.2. H it has the interpretation of a significance level for a single test whether a 
partial correlation is zero or not. The choice of this tuning parameter a can be done 
using cross-validation of the negative out-of-sample log-likelihood. 

We remark that the zeros in T,j} , , are the same for any choice of a DAG in the 

GDAG(a) 

estimated CPDAG GcPDAcCa)- However, the non-zero estimated elements of the es- 
timated matrices will be slightly different. To avoid an unusual random realization 
when selecting a DAG from G'cpdag(")) we can sample many DAGs and average the 
corresponding estimates for or S. 

In some cases, we need some small modifications of the PC-DAG covariance estimator 
which are described in Appendix[Bl Estimation of a CPDAG as described in Algorithm 
[1] and [2] is efficiently implemented in the R-package pcalg, as described in its reference 
manual [17i ]. 



4 Consistency 

We prove asymptotic consistency of the estimation method in high-dimensional settings 
where the number of variables p can be much larger than the sample size n. In such 
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a framework, the model depends on n and this is reflected notationally by using the 
subscript n. We assume: 



(A) The data is as in ([T]) with distribution P„ of {Xi, Xp^) being multivariate 
normal AA(0, Markovian as in Definition 12. II or 12.21 and faithful to a DAG G„. 

(B) The variances satisfy: Var (Xi) = cr^.j < cr^ < oo for all i = 1, .. . ,Pn- 

(C) The dimension pn = 0{n"') for some < a < oo. 

(D) The maximal cardinality g„ = maxj=i^...^p,^ \o-dj{i, of the adjacency sets in Gn 
satisfies g„ = 0{n^~^) for some < 6 < 1/2. 

(E) For any i^j £ l,...,p„, let Pn;i,j\s denote the partial correlation between X^ and 
Xj given S, where S G {1, . . . ,pn} \ These partial correlations are bounded 
above and below: 



for some M < 1, and 



inf ( 



sup 



Pn;i,j\S 



Pn;i,j\S 



< M 



; Pn;i,j\S 



with = 0{n'^) for some < d < 1/4 + 6/2, where h is as in (D). 

(F) For every DAG in the equivalence class of the true underlying CPDAG (induced 
by the distribution in assumption (A)), the conditional variances satisfy the fol- 
lowing bound: 



inf Var [Xj \ Xpa{i)\j j > r > 0, 
inf Var (Xi | Xp^d)) > r > 0. 



Assumption (C) allows the number of variables Pn to grow as an arbitrary polynomial in 
the sample size and reflects the high-dimensional setting. Assumption (D) is a sparseness 
assumption, requiring that the maximal number of neighbors per node grows at a slower 
rate than 0(n2 ). Assumption (F) is a regularity condition on the conditional variances. 
Assumption (E), in particular the second part, is a restriction which corresponds to the 
detectability of non-zero partial correlations: obviously, we cannot consistently detect 
non-zero partial correlations of smaller order than For sparse graphs with h close 

to 1/2 in (D), the value d close to 1/2 is allowed, i.e. close to the l/\Ai detection limit. 
Under assumptions (A)-(E), the PC-algorithm was shown to be consistent for inferring 
the true underlying CPDAG [H, Th.2]. More precisely, we denote by GcPDAGinC") the 
estimate for the underlying CPDAG, using the PC-algorithm with tuning parameter a 
(Algorithms [H and [2]) , and by GcPDAG;n the true underlying CPDAG. Then, assuming 
(A)-(E) and for a„ = 2(1 - $(ni/2c„/2)): 

P[GcVDAG;n{an) = G'cPDAG;n] ^ 1 (n ^ Oo). (9) 
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Concerning the consistency of DAG based estimation of the concentration matrix, we 
have the following new result. 

Lemma 4.1. Under assumptions (A)-(D) and (F) the following holds. For any DAG 
G in the equivalence class of the true underlying GPDA G and using the estimator 
m ^: 



sup 



-1 



''G,n;i,j n;i,j 



(n — > do). 



A proof is given in the Appendix. We then obtain the main theoretical result. 

Theorem 4.1. Under assumptions (A)-(F) and using the tuning parameter an = 2(1 - 
$(n^/^Cn/2)) in the PC-algorithm, the following holds for the estimator in 



sup 



GoAG(a),n;i,j 



(n — > cxd). 



Proof: The estimate GDAG;n(a) is a DAG element of the estimated equivalence class 
encoded by the estimated CPDAG GcPDAG]n{oi)- Denote this DAG by G^. Consider 
the event 

An = {GcPDAG;n{oin) = GcPDAG;n}, 

whose probability P [An] — > 1 (n — > oo), see ([9]). On An, G* must be a DAG element of 
the true equivalence class GcPDAG;n and hence on An, Lemma [4T] yields consistency: 



sup 



sup 



-1 



-1 



G,n;i,j 

Since ^[An] — > 1, the proof is complete 



-'G*,n;i,j n;i,j 



(n ^ oo). 



□ 



5 Simulation and real data analysis 

We examine the behavior of our PC-DAG estimator using simulated and real data and 
compare it to the Glasso method 

Hi- 

The Glasso is defined as: 



^gLso= argmin (- log det S'^ + tr (S^^-^S-i) + A||S-i ||i) (10) 
s-i non-neg. def. 

where Y,^^^ is the empirical covariance matrix in Q, = J2i<j l^jj'^l 

minimization is over non-negative definite matrices. 



All computations are done with the R-packages pcalg 17[ and glasso. 



5.1 Simulation study 

We consider a DAG and a non-DAG model for generating the data. 
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5.1.1 DAG models 



We focus on the following class of DAG models. We generate recursively 

Ai = ei ~AA(0,1), 

Xi = Y^ BirXr + ei {i = 2,...,p), 

r=l 

where ei, . . . , ep i.i.d. ~ AA(0, 1) and B is an adjacency matrix generated as follows. We 
first fill the matrix B with zeros and replace every matrix entry in the lower triangle 
by independent realizations of Bernoulli(s) random variables with success probability 
s where < s < 1. Afterwards, we replace each entry having a 1 in the matrix B by 
independent realizations of a Uniform([0.1,l]) random variable. If i < j and Bji ^ the 
corresponding DAG has a directed edge from node i to node j. The variables Ai, . . . , Xp 
have a multivariate Gaussian distribution with mean zero and covariance T, which can 
be computed from B. We consider this model for different settings of n, s and p: 



Dl: 


n = 


30, 


s = 


0.01, 


P = 


40,50,60,70,80,90, 


100, 


110, 


120 


D2: 


n = 


50, 


s = 


0.01, 


P = 


40,50,60,70,80,90, 


100, 


110, 


120 


D3: 


n = 


30, 


s = 


0.05, 


P = 


40,50,60,70,80,90, 


100, 


110, 


120 


D4: 


n = 


50, 


s = 


0.05, 


P = 


40,50,60,70,80,90, 


100, 


110, 


120 



The settings Dl to D4 mainly differ in the sparsity s of the generated data, which is 
related to the expected neighborhood size E[adj{i,G)] = s{p — 1) for all i. For each 
of these settings we estimate the covariance and the concentration matrix with both 
methods, our PC-DAG and the Glasso estimator. 

We use two different performance measures to compare the two estimation techniques. 
First, the Frobenius norm of the difference between the estimated and the true matrix 
||S — SlIiT' and — S"^!!^?. And second, the Kullback-Leibler Loss AklC^^^ , ^^^) = 
tr(SS-i) - log|SS-i| - p. 

We sample data A^^^ , • • • , A^") i.i.d. from the DAG model described above for each value 
of p in settings D1-D4. Then we derive, on a separate validation data-set A^^^ ; • • • j A^"^ , 
the optimal value of the tuning parameters a (PC-DAG) or A (Glasso), with respect to 
the negative Gaussian log-likelihood. The two different performance measures are eval- 
uated for the estimates based on the training data A*^^), . . . , A*^") with optimal tuning 
parameter choice based on the validation data. All results are based on 50 independent 
simulation runs. 

Figures [U and [2] show that in the sparse settings Dl and D2, the PC-DAG estimator 
clearly outperforms Glasso. Concerning the more dense settings D3 and D4, the PC- 
DAG method degrades only for the covariance matrix, whereas for the inverse covariance 
matrix the figures still show an improvement of the PC-DAG estimator compared 
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to the Glasso. If we match Figured] (a) with Figured] (b) and Figure [2] (a) with Figure [2] 
(b), we see that for a small increase of the sample size the Glasso improves substantially 
less compared to the PC-DAG estimator. The results in terms of the Kullback-Leibler 
loss are summarized in Tabled) 

5.1.2 Non DAG models 

Next we generate data from a non- DAG model proposed by [j^ . The concentration 
matrix equals 

= 5 + 61, 

where each off-diagonal entry in B is generated independently and equals 0.5 with 
probability tt or with probability 1 — vr, all diagonal entries of B are zero, and 5 is 
chosen such that the condition number of is p. The concentration matrices, which 
we generate from this model vary in their level of sparsity: for we take vr = 0.1 

and for we choose vr = 0.5, i.e. S^^^ is sparser than S^^^. Note that the expected 

numbers of non-zero entries in SZ,^, and are proportional to p^. 
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AO 60 80 100 120 10 60 80 100 120 

P P 



(a) For setting Dl (b) For setting D2 




(c) For setting D3 (d) For setting D4 



Figure 2: Plots of ||S ^ — S ^\\p for DAG models. Vertical bars indicate (pointwise) 
95% confidence intervals. 

We generate Gaussian data X^^\ . . . , X^^^ i.i.d. ~ A/'p(0,S) with Y,^^ constructed as 
above, according to the following settings: 



nDl: 


n = 


30, 


TT 


= 0.1, 


P = 


40, 50, 60, 70, 80, 90, 


100, 


110, 


120 


nD2: 


n = 


50, 


TT 


= 0.1, 


P = 


40, 50, 60, 70, 80, 90, 


100, 


110, 


120 


nD3: 


n = 


30, 


IT 


= 0.5, 


P = 


40, 50, 60, 70, 80, 90, 


100, 


110, 


120 


nD4: 


n = 


50, 


TT 


= 0.5, 


P = 


40, 50, 60, 70, 80, 90, 


100, 


110, 


120 



We tune and compare the estimation methods as described in Section [5. 1.1[ 

In Figures [3] and m we see that in case of the dense model with vr = 0.5, the two methods 
do not differ much (some of the differences are so small that they are invisible on the 
scales shown in the plots). But for the sparse model with tt = 0.1 we observe that our 
PC-DAG estimator is better than the Glasso, in particular for the setting nD2. The 
results in terms of the Kullback-Leibler loss are summarized in Table [TJ 
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Kullback-Leibler Loss 


DAG 


n = 30 


P 


s = 0.01 (Dl) 


s = 0.05 (D3) 




Glasso 


PC-DAG 


Glasso 


PC-DAG 


40 


3.78(0.17) 


3.38(0.16) 


13.64(0.41) 


9.27(0.29) 


80 


12.75(0.34) 


11.36(0.29) 


54.63(0.9) 


41.69(0.67) 


120 


25.5(0.41) 


22.93(0.42) 


79.34(1.35) 


104.43(1.47) 


DAG 


n = 50 


P 


s = 0.01 (D2) 


s = 0.05 (D4) 




Glasso 


PC-DAG 


Glasso 


PC-DAG 


40 


3.12(0.15) 


1.88(0.08) 


13.3(0.31) 


6.26(0.18) 


80 


11.07(0.26) 


6.32(0.17) 


53.08(1.22) 


31.83(0.53) 


120 


24.35(0.47) 


13.76(0.27) 


66.21(2.78) 


87.11(0.93) 


non DAG 


n = 30 


P 


Model (nDl) 


Model S^2) (nD3) 




Glasso 


PC-DAG 


Glasso 


PC-DAG 


40 


15.61(0.21) 


14.91(0.22) 


13.53(0.16) 


13.71(0.16) 


80 


35.63(0.45) 


35.9(0.49) 


29.36(0.33) 


29.49(0.33) 


120 


56.44(0.67) 


56.88(0.7) 


45.34(0.45) 


45.76(0.46) 


non DAG 


n = 50 


P 


Model S^^5 (nD2) 


Model (nD4) 




Glasso 


PC-DAG 


Glasso 


PC-DAG 


40 


15.38(0.24) 


10.58(0.18) 


12.76(0.16) 


12.91(0.17) 


80 


34.28(0.4) 


32.13(0.3) 


27.49(0.28) 


27.68(0.34) 


120 


53.69(0.7) 


53.16(0.67) 


42.85(0.5) 


43.08(0.5) 



Table 1: Kullback-Leibler Loss (standard error in parentheses). 
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Figure 3: Plots of ||S — T,\\f for non DAG models. Vertical bars indicate (pointwise) 
95% confidence intervals. 



5.2 Real data 

In this section we compare the two estimation methods for real data. 



5.2.1 Isoprenoid gene pathways in Arabidopsis thaliana 

We analyze the gene expression data from the isoprenoid biosynthesis pathway in Ara- 
bidopsis thaliana given in [i^l . Isoprenoids comprehend the most diverse class of natural 
products and have been identified in many different organisms. In plants isoprenoids 
play important roles in a variety of processes such as photosynthesis, respiration, regu- 
lation of growth and development. 

This data set consists of p = 39 isoprenoid genes for which we have n = 118 gene ex- 
pression patterns under various experimental conditions. As performance measure we 
use the 10-fold cross-validated negative Gaussian log-likelihood for centered data. 

The results are described in Figure [5j We find that none of the two methods performs 
substantially better than the other and the slight superiority of Glasso is in the order 
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PC-DflG 

Glasso 



100 120 



100 120 



(a) For setting nDl 



(b) For setting nD2 





100 120 



(c) For setting nD3 



(d) For setting nD4 



Figure 4: Plots of ||S ^ — Yj for non DAG models. Vertical bars indicate (pointwise) 
% confidence intervals. 



of 1% only. The marginal difference in the negative log-likelihood between the two 
estimation techniques may be due to the high noise in the data. 



5.2.2 Breast Cancer data 

Next, we explore the performance on a gene expression data set from breast tumor 
samples. The tumor samples were selected from the Duke Breast Cancer SPORE tissue 
bank on the basis of several criteria. For more details on the data set see |29| . The data 
matrix monitors p = 7129 genes in n = 49 breast tumor samples. We only use the 100 
variables having the largest sample variance. 

As before we first center the data and then compute the negative log-likelihood via 
10-fold cross-validation. Figure [U] shows the result. 

As for the Isoprenoid gene pathways data-set, we cannot nominate a winner here. In 
fact, the performances are even more indistinct than before. 
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6.0 6.5 
log-(LO norm) 



Figure 5: 10-fold CV of negative log-likelihood against the logarithm of the average 
number of non-zero entries of the estimated concentration matrix S^^. The squares 
stand for the Glasso and the circles for the PC-DAG estimator. 

6 A robust PC-DAG covariance estimator 



In this section we propose a robust version of the PC-DAG estimator. According to 
Section [31 we need an initial covariance matrix estimation T,init in order to run the PC- 
DAG technique. In Section [Sj we used the sample covariance Sjmt = ^mle from ([6]). 
It is well known that the standard sample covariance estimator is not robust against 
outliers or non-Gaussian distributions. 

In order to get a robust version of the PC-DAG method we start with a robust estimate 
of S. We propose to use the orthogonalized Gnanadesikan-Kettenring (OGK) estimator 
presented by [23 |. Employing the OGK estimator in the PC-algorithm, i.e. estimating 
partial correlations from the OGK covariance estimate, we obtain a robustified estimate 
of the CPDAG, see also |l6l |, and finally a robust PC-DAG covariance estimate as in 
d?]) and ([8]) by using again the OGK covariance estimator instead of Tjmle- 



An "ad-hoc" robustification of the Glasso method can be achieved by using in (jlOp the 
robust OGK covariance estimate instead of the sample covariance Timle- 



6.1 Simulation study for non-Gaussian data 

In order to analyze the behavior of the robust PC-DAG method we use a simulation 
model as in Section 15.1.11 but with different distributions for the errors e. Regarding 
the latter, we consider the following distributions: AA(0, 1), 0.9AA(0, 1) +0.1*3(0,1) or 
0.97^(0, 1) + 0.1Cauchy(0, 1). 

We compare the standard PC-DAG, robust PC-DAG, standard Glasso and the robust 
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m - 

CM 



5.0 5.5 6.0 6.5 

log-(LO norm) 

Figure 6: 10-fold CV of negative log-likelihood against the logarithm of the average 
number of non-zero entries of the estimated concentration matrix. The squares stand 
for the Glasso and the circles for the PC-DAG estimator. 

Glasso estimators for Gaussian, 10% is contaminated Gaussian and 10% Cauchy con- 
taminated Gaussian data for one specific parameter setting: 

R : n = 50, p = 80, s = 0.01 

In order to compare the four methods we use the Kullback-Leibler loss defined in Section 
15.1.11 For the four estimation methods we plot the Kullback-Leibler loss against the 
logarithm of the average number of non-zero entries of the estimated concentration 
matrix S"^. The dotted vertical line represents the average number of non-zero entries 
of the true underlying concentration matrices. All the results are again based on 50 
independent simulation runs. 

Figures [7] (a) and [7] (b) show that without or with moderate outliers, the standard and 
robust PC-DAG estimators perform about as well as the standard and robust Glasso: 
the claim is based on the observation that the minimum Kullback-Leibler loss of each 
of the four methods is about the same, although the corresponding sparsity of the fitted 
concentration matrix may be very different. In the presence of more severe outliers, 
the robust PC-DAG technique is best as can be seen from Figure [7] (c). In summary, 
the robust PC-DAG estimator is a useful addition to gain robustness for estimating a 
high-dimensional concentration matrix. 

7 Summary and Discussion 

We have introduced the PC-DAG estimator, a graphical model based technique for es- 
timating sparse covariance matrices and their inverses from high-dimensional data. The 



19 




(a) Gaussian data (b) 10% ts contaminated Gaussian data 




PC-DAG-rob 

Glasso-rob 

- - PC-DAG 



(c) 10% Cauchy contaminated Gaussian 
data 

Figure 7: Kullback-Leibler loss against the logarithm of the average number of non-zero 
elements of for Gaussian data (a), 10% ts contaminated Gaussian data (b) and 
10% Cauchy contaminated Gaussian data (c). 



method is based on very different methodological concepts than shrinkage estimators. 
Our PC-DAG procedure is invariant to variable permutation, yields a positive definite 
estimate of the covariance and concentration matrix, and we have proven asymptotic 
consistency for sparse high-dimensional settings. An implementation of the estimator is 
based on the R-package pcalg [l7|. We remark that alternatively, one could construct 
a high-dimensional covariance estimate based on a sparse undirected conditional inde- 
pendence graph which itself can be inferred from data using e.g. the node-wise Lasso 
procedure from [24]. 

We have compared our PC-DAG estimator with the Glasso [l^, 0] in two simulation 
models. For the concentration matrix, our PC-DAG approach clearly outperforms the 
Glasso technique for some parameter settings, with performance gains up to 30-50%, 
while it keeps up with Glasso for the rest of the considered scenarios. For estimation of 
covariances, the conclusions are similar but slightly less pronounced than for inferring 
concentration matrices. Furthermore, we have compared the two methods in two real 
data-sets and found only marginal differences in performance. If the data generating 
mechanism is well approximated by a DAG-model, the PC-DAG estimator is undoubt- 
edly better than the shrinkage-based Glasso. However, it is very hard to know a-priori 
how well a DAG-model describes the underlying true distribution. Finally, we have 
presented a robustification of our PC-DAG estimator for cases where the Gaussian data 
is contaminated by outliers. 
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Appendix 



A Proof of Lemma 14.11 

A key element of the proofs is the analysis of low-order regression problems described 
in Section [3.11 For a DAG-structure with sets of parents, we consider regressions of the 
form 

j£pa{i) 

and Ei independent of Xpa(j) . The corresponding OLS estimates based on n i.i.d. samples 
X(^), . . . , as in ([I]) are denoted by 



Lemma A.l. Suppose that the Gaussian assumption in (A), assumptions (B) and (F) 
hold. Then, for every e > 0, 







e 


sup 


/3f - ff 


> — 


_«=!,.. .,p„j6pa(i) 





^ ^1 2 

< —QnPn exp 



Qn 



,n 



1) ) +2exp(-C3(--<7„ 



r 



(11) 



n ^ 2(g„ + C4) where Ci, C2 > are constants depending on cr^ and r (see Assumptions 
(B) and (F)), and €3,04 > are absolute constants. 



Proof. The proof is analogous to Lemma 7.1 in [20|]. For completeness, we give a detailed 
derivation. The union bound yields 



sup 

i=l,...,p„,j&pa{i) 



e 








e ' 


> — 


< PnQn sup P 




/3f - pf 


> — 


Qn 






Qn. 



:i2) 



Next we analyze supj j P fSf'' — fSf' 



> e 



for a general e > 0. 



Let i G {l,...,pn} and denote by s{i,j) = pa{i)\j. We consider first the conditional 
distribution of pf^\Xpa{i) ■ The variance of Xi\Xpg_(£^ is '^'j\pa{i) ^'^'^ 'we denote the variance 
of Xj\Xg(j^^j'^ by '^'^j\s(i jy Further, we denote the sample variance of Xj by (t|, the sample 
variance of by and the sample multivariate correlation coefficient 

between Xj and -^s(i,j) by R'j\s(i j)- Then, when conditioning on ^pa[i) = {Xrj] f = 
1, . . . ,n, i G pa{i)], 



Var (pf I X, 



pa{i] 



^j|pa(i) 



^i\pa{i) 



i-Ri 



(^-i^(^>j)i-i)<^,W) 



(13) 
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where the first equahty follows from e.g. [ll], p. 120] and the second equality follows from 



assumption in (A), we get 



. With ([13]), E 13 f I Xpa(i) = f3f' and the Gaussian 



P 
P 



|/?f -/3f I >6|Xp,(,) 



\Z\> 



l^Jn - |s(i, j)| - l<5-j|s(ij) 



X 



(J, 



i\pa{i) 



pa{i) 



(14) 



where Z is a standard normal random variable. 



We first analyze on the set Bj^^^^j^ = {^'j\s(i j) > From assumption (F) 



and Var ( Xi \ ^pa(j) ) ^ it follows that 



inf 



Var (. 



I ^pa(i)\j 



i=i,...,P„,jepa(i) Var (x^ | Xp^ii) 
where v > 0. Using this bound from (1150 we obtain 



> 



(15) 



\Z\> 



l^n - |s(i, j)| - l^j>(jj) 



X 



CJ, 



i|pa(i) 



pa(i) 



< p 

< p 



v/n - \s{i,j)\ - 1 



V2 



|Z| > ev 



\Z\>CeJn-\qn\-l 



(16) 



where C depends on v in (jl5p . We then bound the tail probability of the standard 
normal distribution by P[\Z\ > a] < exp (^f-) for a > 0. Hence, (fT6|) can be 
further bounded by 



— exp (-C2e^(n - g„ - 1)) 



(17) 



for all n such that qn < n — 2, where Ci, C2 > are constants depending on v in (|15p. 
i.e. they depend on cr^ and r in assumptions (B) and (F). 



Next, we compute a bound for P 



. Note that 



B 



c 



•js{i,j) I ^s{t,j) 



x^ 



(n - \s{i,j)\ - l)<T^-|,(i,,-) ^ (n - |s(i, j)| - 1) 



a 



j\s{i,j) 



= p 
< p 



2 ^ (n - |g(^,j)| - 1) 



^n— (Jn — 1 



< 



n — 1" 
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Now we apply Bernstein's inequality 28|, Lemma 2.2.11] by writing 



n-q„-l 



< 



n — 1 



= P 
< P 



2 ~ 1) 
Xn-qn-l -{n-qn-l) < 7^ + 



IX. 



^l- {n-Qn- 1)1 < 



2 

(n-1) 



and noting that Xn~q„-i ~{^~Qn — ^) can be viewed as the sum of n — — 1 independent 
centered Xi random variables. Hence, the last term is bounded above by 



2exp 



n-l 



C3 + C'^{^^ — Qn) J 

where C3, C4 > are constants arising from moment conditions. This expression is in 
addition bounded above by 



,n 



2exp(-C3(--g„-l)) 



(18) 



for all n such that — g„ > C3, and C3 > is a constant arising from moment 
conditions. Because this bound in (fT8|) holds for all ^s(jj) with |s(i,j)| < g„, it also 

holds for the unconditional probability P 



The upper bound for P - pf\ > e now follows by combining (jl7p and (jlSp : 



P 

< 



> e 



B 



> e I pa{i] 



dFx. +P 



< exp i-C2€\n -Qn- 1)) + 2exp {-Csi- - Qn - 1)). 



Now by using e = — we derive 



supP 

id 



> 

-2 



< 



6 Tt 

■exp (-C2^(n - qn- 1)) + 2exp [-C^{- - g„ - 1)) 



(19) 



which holds for all n > 2(g„ + C3) + 2 = 2(g„ + C4). Combining ([19]) with ([12]) we 
complete the proof of Lemma lA.li 

□ 



Lemma A. 2. Suppose that the Gaussian distribution in assumption (A), assumptions 
(B) and (F) hold. Then, for every e > 0, 



sup 

l<i<p„ 



1 



^i\pa{i) 



< pn2 exp 



'^j|pa(j) 

e^(n - 



e 

> — 



+ exp 



r^(n 



24cj4 + 8rf72 
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where C > is an absolute constant and r > as in assumption (F). 



Proof. Using the union bound, for e > 0, 



sup 
1=1,. ..,p„ 



-2 2 



> e 



< Pn sup P 
i=l,...,p„ 



-2 _ 2 

^i\pa(i) '-^i\pa{i) 



> e 



For the conditional probabihty, when conditioning on ^pa{i) = i^rj', r = 1, . . . , n, j G 



pa{i)}, we have that P 



'~^i\pa(i) ^i\pa{i) 



> e I X, 



-pa{i) 



is equal to 



^i\pa{i) 



i\pa(i) 



> 



X 



i\pa{i) 



(n- |pa(i)|)a.|p^(.) 



^i\pa{i) 



■pa{i) 



{n - \pa{i)\) 



> 



e(n — |pa(z)|) 



X 



i\pa{i) 



■pa{i) 



Because "'""T'"'"""'' -(n-|pa(i)|) is a sum of {n — \pa{i)\) independent Xi-distributed 

i\pa{i) 

centered random variables, we can use Bernstein's inequality \2S., Lemma 2.2.11]. Hence, 
with < we get 



^i\pa{i) 



(n - \pa{{)\) 



> 



e(n - \pa{i)\) 



X 



■pa{i) 



< 2 exp 



e^(n — |pa(i)|) 



Since this bound holds for all Xp(i(j), the bound also applies to the unconditional prob- 
ability: 



-2 _ 2 

^i\pa(i) ^i\pa{i) 



> e 



(n- |pa(i)|)<T^^p^(.) 



(n - \pa{i)\) 



i\pa{i) 



> 



e(n - \pa{i)\) 



< 2 exp 



e^(n — |pa(i)|) 



(20) 



We use now a Taylor expansion: 

1 1 



* 2 2 ~ 4 

'^i\pa{i) '~^i\pa(i) '~^i\pa{i) 



i\pa{i) ^i|pa(i)/' 



where 



^i\pa{i) ^i\pa(i) 



< 



^i\pa(i) ^i\pa{i) 



Consider the set B = {supj^^^ 



^i\pa{i) ^i\pa{i) 



< r/2} with r > as in assumption 
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(F). Then, on i?, we have 
i). Therefore, 



< C < oo (and the bound does not depend on the index 



sup 

i=l,...,p„ 



^i\pa{i) '^i\pa(i) 



> 



< P 



C sup 

i=l,...,Pn 



-2 2 
'^i\pa{i) ~ ^i|pa(i) 



> — \nB 



+ P 



B 



c 



The first term and second term on the right-hand side can be bounded using (j20p , leading 
to the bound in the statement of the lemma. This completes the proof of Lemma IA.2[ 

□ 



Proof of Lemma \4-1\ Let G be a DAG from the true underlying CPDAG, i.e. the true 
equivalence class. Using the union bound we have 



sup 



^-1 



-1 



''G,n;i,j 



> 7 



< p„ sup P 



-1 



^-1 



G,n]i,j n\id 



> 7 



(21) 



Since ^ = A^Cov (e) ^ ^ we have ^q^.^ j = Z]fc=i ^fc^fcj^fci with Xk = and A 
in dZD. Thus, 



as 



y-l _V-1 

G,n;i,j n;i,j 



(XkAkjAki — XkAkjAki^ 

k=l 

Pn 
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Pn 
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By plugging these bounds into the formula above and using that the summations are 
over at most g„ terms only (due to sparsity of A^i and A^j), we obtain 



^-1 

^G,n;i,j 



< CqnS 



where C > is an absolute constant and 6 the maximal absolute difference of A's and 
A's: 



6 = maxjmax | — | , max | — | } . 

i,k k 



Hence 



y-i _y-i 

G,n;i,j n;i,j 



> 7 



<P[Cqn\6\>^]=F 



\6\> 



7 
Cq„ 



\s\> 



with ^ = e. Because the convergence of the term P \5\ > is covered either by 
Lemma lA.ll or Lemma \A.2\ since ql = 0(n^-2b) (0 < 6 < 1/2) from assumption (D), 

□ 



and using (121[) , we complete the proof of Lemma 14.1 



B Modifications of the PC-DAG covariance estimator 



With finite sample size, the PC-algorithm may make some errors. One of them can 
produce conflicting v-structures when orienting the graph: if so, we deal with it by 
keeping one and discarding other v-structures. In our implementation, the result then 
depends on the order of the performed independence tests. Furthermore, it may happen 
that the output of the PC-algorithm is an invalid CPDAG which does not describe an 
equivalence class of DAGs. In such a case we use the retry type orientation procedure 
implemented in the pcAlgo-function of the pcalg-package, see the reference manual of 
the pcalg-package [17| for more information. 
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